Day08 - HDFS 基本操作 (FileSystem Shell & Python) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 8

AI & Data

30天認識主流大數據框架：Hadoop + Spark + Flink系列第 8 篇

Day08 - HDFS 基本操作 (FileSystem Shell & Python)

15th鐵人賽

RiceBugJ

2023-09-23 00:32:48

1254 瀏覽

分享至

今天要介紹 HDFS 的基本操作，包括文件的寫入、讀取與刪除等，除了介紹基本的 Shell commands 外，也會介紹如何透過 Python 來操作 HDFS。

程式碼
這次參賽的程式碼都會放在 Big-Data-Framework-30-days，建議大家直接把整個 repo clone 下來，然後參考 README 進行基本設置，接著直接 cd 到今天的資料夾內。

啟動與終止

記得先啟動 DFS 才有辦法進行後續操作，當然操作完也記得要終止喔。

# 啟動 DFS
$HADOOP_HOME/sbin/start-dfs.sh
# 終止
$HADOOP_HOME/sbin/stop-dfs.sh

FileSystem Shell

File System (FS) shell 中包含了許多與 shell 類似的指令，用來與 HDFS 或其他 Hadoop 支援的檔案系統互動（如Local FS、WebHDFS、S3 FS 等），使用方式如下：

hadoop fs <args> 
# or
hdfs dfs <args>

1. 寫入

建立 HDFS 目錄
```
hadoop fs -mkdir day08
```

存入文件/目錄 (local >> HDFS)

# 建一個測試用文檔    
echo "Write into HDFS" > test_writing.txt
# put 用法: hadoop fs -put <local/file_or_dir> <HDFS/file_or_dir> 
hadoop fs -put test_writing.txt day08

如果重複寫入的話會出現錯誤訊息：

列出目錄
```
hadoop fs -ls day08
```
可以看到剛剛寫入的檔案已經出現在目錄中了：

2. 讀取

印出文件
```
hadoop fs -cat day08/test_writing.txt
```
文件內容：
取出文件/目錄 (HDFS >> local)
```
# get 用法: hadoop fs -get <HDFS/file_or_dir> <local/file_or_dir> 
hadoop fs -get day08/test_writing.txt get_test_writing.txt
```
(由於 test_writing.txt 已經存在當前 local 路徑了，因此順便將檔案重新命名為 get_test_writing.txt。)

3. 刪除

刪除文件
```
hadoop fs -rm day08/test_writing.txt
```
遞迴刪除目錄與內容
```
hadoop fs -rm -r day08
```

4. 更新？

如果我們以 CRUD 的角度去思考的話，會發現沒有提到 Update (更新)，那是因為 HDFS 不能夠直接更新文件，如果要更新文件，則通常要重新寫入 + 刪除舊文件。

Python

接著我們使用 Python API 來重複一遍上面的操作，程式碼都放在我的 github 上。

1. 安裝套件

pip install hdfs

2. 建立連線 (nameNode)

# make connections
from hdfs import InsecureClient
client = InsecureClient("http://localhost:9870/", user='mengchiehliu')

如果不指定user的話，預設會帶入作業系統的user
nameNode 預設 port 是9870，可以用瀏覽器打http://localhost:9870 看有沒有出現 nameNode Web UI，沒有的話記得先啟動 hdfs。如果想使用其他port的話，可以到 etc/hadoop/hdfs-site.xml 中加入下方配置：
```
<property>
  <name>dfs.namenode.http-address</name>
  <value>localhost:{{port}}</value>
</property>
```
InsecureClient 只有在 HDFS security 關閉時可以使用，security 環境中要使用 TokenClient，兩個都是 Clinet 的 subclass，盡量使用 subclass 而不是用 Client 連線。

3. 寫入

建立 HDFS 目錄
```
client.makedirs('day08')
```

存入文件/目錄 (local >> HDFS)

import os
local_path = os.path.join(os.path.dirname(__file__), "test_writing.txt")
return_path = client.upload(hdfs_path="day08/test_writing.txt", local_path=local_path, overwrite=True)
print("成功後回傳 hdfs_path:", return_path)

列出目錄

dirs = client.list('day08')
print("查看目錄內容:", dirs)

4. 讀取

印出文件

with client.read('day08/test_writing.txt') as reader:
    content = reader.read()
    print('文件內容:', content)

取出文件/目錄 (HDFS >> local)

new_local_path = os.path.join(os.path.dirname(__file__), "get_test_writing.txt")
return_path = client.download(hdfs_path="day08/test_writing.txt", local_path=new_local_path, overwrite=True)
print("成功後回傳 local_path:", return_path)

5. 刪除

刪除文件

print('刪除文件?', client.delete('day08/test_writing.txt'))

遞迴刪除目錄與內容

print('遞迴刪除目錄與內容?', client.delete('day08', recursive=True))

6. 其他操作

這邊就不示範了，大家可以自己試試。

# Retrieving a file or folder content summary.
content = client.content('<hdfs/path>')

# Retrieving a file or folder status.
status = client.status('<hdfs/path>')

# Renaming ("moving") a file.
client.rename('<hdfs/path>', '<hdfs/new/path>')